The aim of this study is to predict tomorrow’s hourly electricity price data (market clearing prices (MCP) per each hour) with using predictive AR and MA models and regression models . The data is acquired from EPİAŞ, from 1st of June, 2015 till the 21th of June, 2021. The last 30 days will be used for testing the models.
Importing necessary libraries
library("readxl")
library("ggplot2")
library("tidyverse")
library("zoo")
library(plotly)
library("data.table")
library(forecast)
library(ggcorrplot)
library(stats)
library(urca)
library(lubridate)
Target variable and training variables is gathered as an csv file from the EPİAŞ:
df <- fread("mcp_with_variables.csv")
str(df)
## Classes 'data.table' and 'data.frame': 53112 obs. of 19 variables:
## $ date : IDate, format: "2015-06-01" "2015-06-01" ...
## $ hour : int 0 1 2 3 4 5 6 7 8 9 ...
## $ mcp_try : num 103.8 88 49.1 33.8 28.9 ...
## $ mcp_dollars: num 39 33.1 18.4 12.7 10.9 ...
## $ mcp_euro : num 35.61 30.19 16.83 11.59 9.93 ...
## $ load_plan : int 25600 23900 23000 22600 22500 21900 22300 25000 30200 33200 ...
## $ total_prod : num 20657 20292 19666 19650 19664 ...
## $ natural_gas: num 6198 6391 6403 6388 6300 ...
## $ wind : num 169 166 172 182 189 ...
## $ lignite : num 2799 2799 2799 2799 2799 ...
## $ black_coal : num 742 630 630 630 630 630 630 742 742 714 ...
## $ import_coal: num 3901 3708 3179 3112 3113 ...
## $ fuel_oil : num 417 417 417 417 417 417 417 417 462 462 ...
## $ geothermal : num 74.8 74.8 74.8 74.8 74.8 74.8 74.8 74.8 74.8 74.8 ...
## $ dam : num 5057 4833 4726 4824 4798 ...
## $ naphta : num 0 0 0 0 0 0 0 0 0 0 ...
## $ biomass : num 0 0 0 0 0 0 0 0 0 0 ...
## $ river : num 1159 1134 1125 1082 1203 ...
## $ other : num 140 140 140 140 140 140 140 140 140 140 ...
## - attr(*, ".internal.selfref")=<externalptr>
drop_na(df)
#converting to data table
setDT(df)
#combining Date and Hour column
#df$NewDate <-ymd(df$date)+as.xts(df$hour)
df$year<-format(df$date, "%y")
df$day<-format(df$date,"%m/%d")
df$month<-format(df$date, "%m")
df[,NewDate:=ymd(date)+dhours(hour)]
There are no missing values.
Visualizing a sample of currency data (TL, EUR, USD) and obtaining the correlation matrix:
ts.plot(df$mcp_try[50000:52000])
ts.plot(df$mcp_dollars[50000:52000])
ts.plot(df$mcp_euro[50000:52000])
cor(df[,3:5])
## mcp_try mcp_dollars mcp_euro
## mcp_try 1.0000000 0.5509976 0.5108139
## mcp_dollars 0.5509976 1.0000000 0.9920998
## mcp_euro 0.5108139 0.9920998 1.0000000
Time series sample visualization is showing that MCP is having similar seasonality and trend in all three currencies. However, TL price is not significantly correlated with (approx 0.50) EUR and USD (EUR and USD is %99 correlated). This can indicate that market prices are regulated by EPİAŞ, the government, etc, and not increased or decreased with sudden TRY/USD exchange rates. Therefore, TL MCP will be utilized further in this study.
p <- ggplot(data = df, aes(x = NewDate, y = mcp_try ,group=1))+
geom_line(color = '#007cc3') + labs(title = 'Hourly Electricity Price in Turkey between 2015-2021 (TRY)', x= "Date", y = "Hourly Electricity Price(TRY)") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
#fig <- ggplotly(p)
#fig <- fig %>% layout(dragmode = "pan")
#fig
p
It can be clearly seen that the mean and variance is increasing between 2015-2020. Therefore the data is not stationary. It has a positive trend. To conclude whether the data is stationary or not, KPSS Unit Root Test is used:
KPSS_test= ur.kpss(df$mcp_try)
summary(KPSS_test)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 19 lags.
##
## Value of test-statistic is: 174.2189
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
plot(KPSS_test)
The null hypothesis is the data is stationary, and the value of test-statistic exceeds critical values significantly. Therefore, the null hypothesis is rejected and can be concluded that the data is non-stationary.
acf(df$mcp_try)
The lags are is exponentially decreasing and increasing in 24 lags of period. The decrease and increase indicate that trend exist in the data. The peaks are observed at lag 0,24 and 48, indicating that the data has daily seasonality(24 hours of period).
The visualization of monthly sample the time series data from June 1 2015 to June 30 2015:
ts.plot(df$mcp_try[1:720])
For the preliminary analysis, it can be seen that the data has a weekly and daily seasonality(daily seasonality is already found). For further understanding, decompositions at different levels(hourly, daily, weekly, monthly) are required.
df_new <- ts(df$mcp_try,frequency=24)
ts.plot(df$mcp_try[0:72])
df_additive <- decompose(df_new,type="additive")
plot(df_additive)
From the figure, seasonality is not clearly observable, but from the 3 days figure above, we know that prices are lower between 03.00-09.00 everyday , meaning that hourly seasonality exists.
To analyze daily trends and seasonality, the hourly data is grouped into daily data by averaging 24 hours’ period.
daily_df <- df %>% group_by(year,day) %>% summarize(DailyAvg = mean(`mcp_try`))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
df_new <- ts(daily_df$DailyAvg,frequency=7)
df_additive <- decompose(df_new,type="additive")
plot(df_additive)
The trend and seasonality is clearly observable from the above figure. Random component is similar to white noise, but large deviations can be observed in the special days, i.e. national holidays.
To analyze daily trends and seasonality, the daily data is grouped into monthly data by averaging daily period.
monthly_df <- df %>% group_by(year,month,day) %>% summarize(DailyAvg = mean(`mcp_try`))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
monthly_df <- monthly_df %>% group_by(year,month) %>% summarize(Monthly = mean(DailyAvg))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
df_new <- ts(monthly_df$Monthly,frequency=12)
df_additive <- decompose(df_new,type="additive")
plot(df_additive)
For the monthly data, the seasonality and the trend line are very clearly observable.
Since the hourly and daily seasonality exist, there is a pattern at every 168 hours(7 days). The AR and MA models will be built upon this pattern.
After we found out that there is a pattern at every 168 hours(7 days), the decomposition of the time-series data:
df_new <- ts(df$mcp_try,frequency=168)
df_additive <- decompose(df_new,type="additive")
plot(df_additive)
ts.plot(df$mcp_try[1:144])
For 7 days’ period from monday to sunday(1 June-7 June 2015), the electricity prices is has similar pattern not only in weekdays, but also the weekends.From the decomposition data, trend and seasonality exists in the data, and should be eliminated for the predictive models. The noise is also similar to white noise, with large deviatations occuring in special days.
For deseasonalizing and detrendng the Time-series, the trend and seasonal components should be extracted from the consumption data:
df_deseasoned_detrended <- df_new-df_additive$seasonal-df_additive$trend
p <- ggplot(data = df_deseasoned_detrended, aes(x = df$NewDate, y = df_deseasoned_detrended,group=1))+
geom_line(color = '#007cc3') + labs(title = 'Deaseasoned and Detrended Data', x= "Date", y = "Hourly Electricity Price(TRY)") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
p
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
## Warning: Removed 168 row(s) containing missing values (geom_path).
The data is now more stationary and has no increasing or decreasing trend. To determine appropriate AR,MA, and ARMA models, KPSS Unit Root Test can be applied for checking the stationarity of the data for the differencing (d) parameter. Also, ACF and PACF plots should be investigated for p and q parameters:
KPSS_test= ur.kpss(df_deseasoned_detrended)
summary(KPSS_test)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 19 lags.
##
## Value of test-statistic is: 7e-04
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The null hypothesis is not rejected this time, since value of test-statistic is lower than the critical values. Therefore I can conclude that the data is now stationary.
acf(df_deseasoned_detrended,na.action=na.pass)
pacf(df_deseasoned_detrended,na.action=na.pass)
From the ACF figure, the q parameter should be between 1-7, also 24(not included because of computational comlexity). From the PACF figure, the p parameter should either be between 1-3. For the training, the test set(last 30 days) should be excluded from the dataset. Starting a neighborhood search with AR models:
Train-test set split:
df_train <- df_deseasoned_detrended[1:(length(df_deseasoned_detrended)-30*24)]
df_test <- df_deseasoned_detrended[(length(df_deseasoned_detrended)-30*24+1):length(df_deseasoned_detrended)]
AR Models:
model <- arima(df_train, order=c(1,0,0))
print(model)
##
## Call:
## arima(x = df_train, order = c(1, 0, 0))
##
## Coefficients:
## ar1 intercept
## 0.6911 0.0042
## s.e. 0.0032 0.4309
##
## sigma^2 estimated as 926.6: log likelihood = -252894.2, aic = 505794.5
AIC(model)
## [1] 505794.5
model <- arima(df_train, order=c(2,0,0))
print(model)
##
## Call:
## arima(x = df_train, order = c(2, 0, 0))
##
## Coefficients:
## ar1 ar2 intercept
## 0.6743 0.0244 0.0042
## s.e. 0.0044 0.0044 0.4417
##
## sigma^2 estimated as 926.1: log likelihood = -252878.6, aic = 505765.2
AIC(model)
## [1] 505765.2
model <- arima(df_train, order=c(3,0,0))
print(model)
##
## Call:
## arima(x = df_train, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 0.6732 -0.0044 0.0428 0.0042
## s.e. 0.0044 0.0053 0.0044 0.4609
##
## sigma^2 estimated as 924.4: log likelihood = -252830.7, aic = 505671.5
AIC(model)
## [1] 505671.5
The lowest AIC is achieved by p=3 and will be used in the ARMA model.
As mentioned before, from the PACF figure, the q parameter should either be 1-7 for the MA models. For computational complexity, q values will be between 1-5
model <- arima(df_train, order=c(0,0,1))
print(model)
##
## Call:
## arima(x = df_train, order = c(0, 0, 1))
##
## Coefficients:
## ma1 intercept
## 0.5679 0.0042
## s.e. 0.0030 0.2313
##
## sigma^2 estimated as 1139: log likelihood = -258286.6, aic = 516579.3
AIC(model)
## [1] 516579.3
model <- arima(df_train, order=c(0,0,2))
print(model)
##
## Call:
## arima(x = df_train, order = c(0, 0, 2))
##
## Coefficients:
## ma1 ma2 intercept
## 0.6586 0.2932 0.0042
## s.e. 0.0042 0.0036 0.2725
##
## sigma^2 estimated as 1019: log likelihood = -255390.8, aic = 510789.6
AIC(model)
## [1] 510789.6
model <- arima(df_train, order=c(0,0,3))
print(model)
##
## Call:
## arima(x = df_train, order = c(0, 0, 3))
##
## Coefficients:
## ma1 ma2 ma3 intercept
## 0.6690 0.3975 0.2193 0.0042
## s.e. 0.0043 0.0043 0.0040 0.3104
##
## sigma^2 estimated as 964.9: log likelihood = -253953.8, aic = 507917.7
AIC(model)
## [1] 507917.7
model <- arima(df_train, order=c(0,0,4))
print(model)
##
## Call:
## arima(x = df_train, order = c(0, 0, 4))
##
## Coefficients:
## ma1 ma2 ma3 ma4 intercept
## 0.6728 0.4217 0.2931 0.1456 0.0042
## s.e. 0.0044 0.0050 0.0045 0.0042 0.3403
##
## sigma^2 estimated as 943.8: log likelihood = -253373.5, aic = 506759
AIC(model)
## [1] 506759
model <- arima(df_train, order=c(0,0,5))
print(model)
##
## Call:
## arima(x = df_train, order = c(0, 0, 5))
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 intercept
## 0.6700 0.4289 0.3229 0.2050 0.1039 0.0042
## s.e. 0.0044 0.0051 0.0048 0.0048 0.0043 0.3648
##
## sigma^2 estimated as 933.4: log likelihood = -253084.5, aic = 506183
AIC(model)
## [1] 506183
q=5 gives the best AIC(lowest). Increasing q further can be even better, but due to computational complexity, q=5 will be further utilized.
Starting with best models(models with lowest AIC):
model <- arima(df_train, order=c(3,0,5))
print(model)
##
## Call:
## arima(x = df_train, order = c(3, 0, 5))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 ma4 ma5
## 0.4562 0.6593 -0.4184 0.2159 -0.5193 0.1338 0.0999 0.0250
## s.e. 0.0569 0.0508 0.0397 0.0569 0.0587 0.0094 0.0100 0.0085
## intercept
## 0.0046
## s.e. 0.4185
##
## sigma^2 estimated as 921.6: log likelihood = -252752.4, aic = 505524.8
AIC(model)
## [1] 505524.8
BIC(model)
## [1] 505613.4
The AIC value is better(lower) from the above models. Checking the residuals:
checkresiduals(model)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,0,5) with non-zero mean
## Q* = 95.872, df = 3, p-value < 2.2e-16
##
## Model df: 9. Total lags used: 12
The autocorrelation can be still observed, most at lag 24(yesterday’s same hour data); however, the residuals are normally distributed. Visual comparison of the prediction and test set would be a good idea. For the seasonality, I took the first 1424 period, since it has a replying pattern for each of the 336 data points. Also for trend component, I took the last 1424 trend values, since the last
model_forecast <- predict(model, n.ahead = 30*24)$pred
last_trend_value <-tail(df_additive$trend[!is.na(df_additive$trend)],30*24)
seasonality <- df_additive$seasonal[1:(30*24)]
forecast_combined <-model_forecast+last_trend_value+seasonality
df$forecast <- 0
df$forecast[(length(df_deseasoned_detrended)-30*24+1):length(df_deseasoned_detrended)] <-forecast_combined
p <- ggplot()+
geom_line(data = df[52000:53112],aes(x = NewDate,y=mcp_try,color='Actual Data') ) +
geom_line(data = df[(53112-30*14+1):53112],aes(x = NewDate,y=forecast,color='Prediction')) +
scale_color_manual(values = c(
'Actual Data' = '#007cc3',
'Prediction' = 'red')) + labs(title = 'Hourly Electricity Price(TL/MWh) Prediction', x= "Date", y = "Hourly Electricity Price(TL/MWh)",color = 'Legend')
theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
## List of 2
## $ text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "#004785"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "#004785"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
p
The model can predict daily and hourly seasonality, but can not predict the deviation from the mean after the 10 days, i.e the model is more accurate for predicting the first 10 days, but then the hourly deviations can not be predicted by the model. To evaulate the model and to measure overall accuracy of the model daily mean absolute percentage error for each date and weighted mean absolute percentage error (WMAPE) are calculated:
metrics <- df[(53112-30*24+1):53112] %>% group_by(year,day) %>% summarize(MAPE = mean(abs((mcp_try-forecast)/mcp_try)) * 100)
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
metrics
On the contrary of visual inspection of the real data and prediction, MAPE is stable in 30 days, with high MAPE’s are observed in 7 days/30 days. To further improve the model, the high deviations can be analyzed. However, for the sake of the study, forecasting with regression will be done. WMAPE of the model:
wmape <- abs(sum((df[(53112-30*24+1):53112]$mcp_try-df[(53112-24*30+1):53112]$forecast)*metrics$MAPE*100))/sum(df[(53112-24*30+1):53112]$mcp_try)
wmape
## [1] 0.5896703
WMAPE is approximately 6%.
Regression analysis would give a better model, since we have more than 10 variables related to MCP data. After splitting into test and train set, fitting the data with trend line, year and month variables and variables in the data(excluding EUR and USD prices):
#trend:
df_train <- df[1:(53112-30*24),]
df_test <- df[(53112-30*24+1):53112,]
df_train[,trend:=1:.N]
fit <- lm(mcp_try ~hour+month+load_plan+total_prod+natural_gas+wind+lignite+black_coal+import_coal+fuel_oil+geothermal+dam+naphta+biomass+river+other+year ,data=df_train)
summary(fit)
##
## Call:
## lm(formula = mcp_try ~ hour + month + load_plan + total_prod +
## natural_gas + wind + lignite + black_coal + import_coal +
## fuel_oil + geothermal + dam + naphta + biomass + river +
## other + year, data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -240.62 -26.44 -0.57 26.72 1680.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.003e+02 3.191e+00 -31.435 < 2e-16 ***
## hour -4.689e-01 3.812e-02 -12.301 < 2e-16 ***
## month02 -6.974e+00 1.109e+00 -6.289 3.23e-10 ***
## month03 1.038e+01 1.276e+00 8.136 4.17e-16 ***
## month04 2.281e+01 1.571e+00 14.519 < 2e-16 ***
## month05 2.809e+01 1.633e+00 17.200 < 2e-16 ***
## month06 7.039e+00 1.375e+00 5.118 3.09e-07 ***
## month07 -1.368e+01 1.273e+00 -10.746 < 2e-16 ***
## month08 4.300e+00 1.263e+00 3.404 0.000664 ***
## month09 2.695e+01 1.280e+00 21.046 < 2e-16 ***
## month10 3.945e+01 1.302e+00 30.295 < 2e-16 ***
## month11 2.815e+01 1.427e+00 19.731 < 2e-16 ***
## month12 2.366e+01 1.430e+00 16.539 < 2e-16 ***
## load_plan 1.136e-03 1.169e-04 9.717 < 2e-16 ***
## total_prod 4.342e+01 8.075e+01 0.538 0.590793
## natural_gas -4.342e+01 8.075e+01 -0.538 0.590822
## wind -4.342e+01 8.075e+01 -0.538 0.590766
## lignite -4.341e+01 8.075e+01 -0.538 0.590868
## black_coal -4.340e+01 8.075e+01 -0.537 0.590995
## import_coal -4.340e+01 8.075e+01 -0.537 0.590983
## fuel_oil -4.342e+01 8.075e+01 -0.538 0.590808
## geothermal -4.348e+01 8.075e+01 -0.538 0.590287
## dam -4.341e+01 8.075e+01 -0.538 0.590864
## naphta -4.666e+01 8.075e+01 -0.578 0.563366
## biomass -4.325e+01 8.075e+01 -0.536 0.592233
## river -4.342e+01 8.075e+01 -0.538 0.590766
## other -4.345e+01 8.075e+01 -0.538 0.590540
## year16 -4.526e-01 1.503e+00 -0.301 0.763241
## year17 2.289e+01 2.096e+00 10.919 < 2e-16 ***
## year18 5.608e+01 3.300e+00 16.995 < 2e-16 ***
## year19 6.548e+01 4.266e+00 15.348 < 2e-16 ***
## year20 8.221e+01 5.393e+00 15.242 < 2e-16 ***
## year21 1.245e+02 7.408e+00 16.809 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.09 on 52359 degrees of freedom
## Multiple R-squared: 0.712, Adjusted R-squared: 0.7119
## F-statistic: 4046 on 32 and 52359 DF, p-value: < 2.2e-16
#checkresiduals(fit)
Adjusted R-squared (0.69) gives a initial good-start,parameters with high p values will be discarded on the next model. Residuals are not checked for now, due to computational complexity freezes my computer:
fit <- lm(mcp_try ~hour+load_plan+year+month ,data=df_train)
summary(fit)
##
## Call:
## lm(formula = mcp_try ~ hour + load_plan + year + month, data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -259.01 -27.57 0.54 30.25 1674.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.930e+01 1.942e+00 -45.978 < 2e-16 ***
## hour -4.159e-02 3.931e-02 -1.058 0.290
## load_plan 6.560e-03 5.704e-05 115.001 < 2e-16 ***
## year16 9.873e+00 9.570e-01 10.316 < 2e-16 ***
## year17 2.066e+01 9.705e-01 21.282 < 2e-16 ***
## year18 8.578e+01 9.749e-01 87.992 < 2e-16 ***
## year19 1.154e+02 9.734e-01 118.518 < 2e-16 ***
## year20 1.351e+02 9.710e-01 139.149 < 2e-16 ***
## year21 1.752e+02 1.328e+00 131.957 < 2e-16 ***
## month02 -5.899e+00 1.148e+00 -5.139 2.77e-07 ***
## month03 -9.233e+00 1.126e+00 -8.201 2.44e-16 ***
## month04 -1.079e+01 1.151e+00 -9.375 < 2e-16 ***
## month05 -7.003e-01 1.159e+00 -0.604 0.546
## month06 1.292e+01 1.159e+00 11.144 < 2e-16 ***
## month07 6.416e+00 1.147e+00 5.591 2.27e-08 ***
## month08 2.706e+01 1.149e+00 23.548 < 2e-16 ***
## month09 4.815e+01 1.151e+00 41.841 < 2e-16 ***
## month10 5.688e+01 1.152e+00 49.393 < 2e-16 ***
## month11 4.399e+01 1.154e+00 38.137 < 2e-16 ***
## month12 3.613e+01 1.142e+00 31.625 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.98 on 52372 degrees of freedom
## Multiple R-squared: 0.6505, Adjusted R-squared: 0.6503
## F-statistic: 5129 on 19 and 52372 DF, p-value: < 2.2e-16
#checkresiduals(fit)
Now all values except hour are significant. We intuitively know that hour is an important variable, so I decided to leave in the model for know. Introducting a ratio type variable, renewable energy generation/total_production:
df$ratio <- (df$wind+df$lignite + df$geothermal + df$biomass)/df$total_prod
df[,trend:=1:.N]
df_train <- df[1:(53112-30*24),]
df_test <- df[(53112-30*24+1):53112,]
fit <- lm(mcp_try ~hour+load_plan+year+month+ratio ,data=df_train)
summary(fit)
##
## Call:
## lm(formula = mcp_try ~ hour + load_plan + year + month + ratio,
## data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -243.71 -27.26 -0.15 29.59 1690.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.661e+00 2.436e+00 -3.966 7.31e-05 ***
## hour -1.611e-01 3.840e-02 -4.195 2.73e-05 ***
## load_plan 5.073e-03 6.254e-05 81.116 < 2e-16 ***
## year16 2.879e+01 1.002e+00 28.746 < 2e-16 ***
## year17 4.696e+01 1.073e+00 43.757 < 2e-16 ***
## year18 1.267e+02 1.235e+00 102.653 < 2e-16 ***
## year19 1.604e+02 1.285e+00 124.823 < 2e-16 ***
## year20 1.780e+02 1.256e+00 141.739 < 2e-16 ***
## year21 2.324e+02 1.699e+00 136.788 < 2e-16 ***
## month02 -4.532e+00 1.120e+00 -4.048 5.18e-05 ***
## month03 -9.644e+00 1.098e+00 -8.784 < 2e-16 ***
## month04 -1.774e+01 1.130e+00 -15.699 < 2e-16 ***
## month05 -8.323e+00 1.139e+00 -7.306 2.80e-13 ***
## month06 1.139e+01 1.131e+00 10.073 < 2e-16 ***
## month07 1.060e+01 1.122e+00 9.445 < 2e-16 ***
## month08 3.399e+01 1.128e+00 30.124 < 2e-16 ***
## month09 5.047e+01 1.123e+00 44.941 < 2e-16 ***
## month10 5.745e+01 1.123e+00 51.157 < 2e-16 ***
## month11 4.938e+01 1.130e+00 43.710 < 2e-16 ***
## month12 4.654e+01 1.132e+00 41.113 < 2e-16 ***
## ratio -2.757e+02 5.303e+00 -51.993 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.67 on 52371 degrees of freedom
## Multiple R-squared: 0.6676, Adjusted R-squared: 0.6675
## F-statistic: 5259 on 20 and 52371 DF, p-value: < 2.2e-16
checkresiduals(fit)
##
## Breusch-Godfrey test for serial correlation of order up to 24
##
## data: Residuals
## LM test = 35101, df = 24, p-value < 2.2e-16
All variables are significant for the model and the ratio of renewable prod./total prod. is also improved the R-squared values. Looking the resudials, they are evenly distributed and have zero mean. However, from the ACF, lagged variables should be introduced to increase the accuracy of the model. We already know that there is hourly, daily and weekly seasonality. Let’s try them for our predictive models:
df$target_lagged_1 <-shift(df$mcp_try, 1)
df$target_lagged_24 <-shift(df$mcp_try, 24)
df$target_lagged_168 <-shift(df$mcp_try, 168)
df$ratio <- (df$wind+df$lignite + df$geothermal + df$biomass)/df$total_prod
df[,trend:=1:.N]
df_train <- df[1:(53112-30*24),]
df_test <- df[(53112-30*24+1):53112,]
fit <- lm(mcp_try ~hour+load_plan+year+month+ratio+target_lagged_1+target_lagged_24+target_lagged_168 ,data=df_train)
summary(fit)
##
## Call:
## lm(formula = mcp_try ~ hour + load_plan + year + month + ratio +
## target_lagged_1 + target_lagged_24 + target_lagged_168, data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -431.96 -10.17 -0.26 11.02 1278.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.129e+00 1.448e+00 3.542 0.000398 ***
## hour -4.815e-01 2.272e-02 -21.191 < 2e-16 ***
## load_plan 6.924e-04 3.993e-05 17.342 < 2e-16 ***
## year16 7.421e+00 6.009e-01 12.349 < 2e-16 ***
## year17 9.077e+00 6.503e-01 13.959 < 2e-16 ***
## year18 1.912e+01 8.104e-01 23.592 < 2e-16 ***
## year19 2.251e+01 8.818e-01 25.531 < 2e-16 ***
## year20 2.294e+01 8.973e-01 25.569 < 2e-16 ***
## year21 2.965e+01 1.199e+00 24.740 < 2e-16 ***
## month02 -1.885e+00 6.605e-01 -2.854 0.004318 **
## month03 -1.227e+00 6.478e-01 -1.894 0.058220 .
## month04 -8.854e-01 6.683e-01 -1.325 0.185251
## month05 -7.023e-01 6.723e-01 -1.045 0.296186
## month06 3.168e+00 6.720e-01 4.714 2.43e-06 ***
## month07 1.939e-01 6.622e-01 0.293 0.769697
## month08 1.814e+00 6.736e-01 2.692 0.007095 **
## month09 3.024e+00 6.811e-01 4.440 9.00e-06 ***
## month10 4.835e+00 6.854e-01 7.053 1.77e-12 ***
## month11 5.468e+00 6.813e-01 8.025 1.03e-15 ***
## month12 4.139e+00 6.817e-01 6.072 1.27e-09 ***
## ratio -1.116e+02 3.212e+00 -34.744 < 2e-16 ***
## target_lagged_1 5.938e-01 2.971e-03 199.876 < 2e-16 ***
## target_lagged_24 1.793e-01 2.747e-03 65.268 < 2e-16 ***
## target_lagged_168 1.654e-01 2.716e-03 60.905 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.45 on 52200 degrees of freedom
## (168 observations deleted due to missingness)
## Multiple R-squared: 0.8844, Adjusted R-squared: 0.8843
## F-statistic: 1.736e+04 on 23 and 52200 DF, p-value: < 2.2e-16
checkresiduals(fit)
##
## Breusch-Godfrey test for serial correlation of order up to 27
##
## data: Residuals
## LM test = 6334.2, df = 27, p-value < 2.2e-16
The lagged variables increased the R-squared values by approximately 0.2, and all variables are significant except some of the months. With achieving 0.88 adjusted R-squares, lets visualize the forecast and compare the model with ARMA model.
pred <- predict(fit, df_test)
model_forecast <- pred
df$forecast <- 0
df$forecast[(53112-30*24+1):53112] <-model_forecast
p <- ggplot()+
geom_line(data = df[52000:53112],aes(x = NewDate,y=mcp_try,color='Actual Data') ) +
geom_line(data = df[(53112-30*14+1):53112],aes(x = NewDate,y=forecast,color='Prediction')) +
scale_color_manual(values = c(
'Actual Data' = '#007cc3',
'Prediction' = 'red')) + labs(title = 'Hourly Electricity Price(TL/MWh) Prediction', x= "Date", y = "Hourly Electricity Price(TL/MWh)",color = 'Legend')
theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
## List of 2
## $ text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "#004785"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "#004785"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
p
From the visualization, the model does a pretty good job for predicting the 30 days of electricity price. To understand better, MAPE and WMAPE values should be checked:
metrics <- df[(53112-30*24+1):53112] %>% group_by(year,day) %>% summarize(MAPE = mean(abs((mcp_try-pred)/mcp_try)) * 100)
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
metrics
wmape <- abs(sum((df[(53112-30*24+1):53112]$mcp_try-df[(53112-24*30+1):53112]$forecast)*metrics$MAPE*100))/sum(df[(53112-24*30+1):53112]$mcp_try)
wmape
## [1] 6.02005
WMAPE is approximately 6%. It is a little bit worse than the best ARMA model (approx. WMAPE 5.98) High MAPE values are observed in the same days for both models.
In conclusion, The hourly electricity price data can be predicted accurately both with ARMA models and regression models. However, I would expect regression model can do better, since we introduced production, renewable energy production data, etc. into the model. For future work, a more detailed regression analysis should be done, and more variables can be further leveraged with feature engineering. Also, adding variables one by one to the regression model would be better for the analysis, I added all of the variables at the first model, due to time restrictions.